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Abstract 

In this work, we modify the superparamagnetic clustering algorithm (SPC) by 
adding an extra weight to the interaction formula that considers which genes 
are regulated by the same transcription factor. With this modified algorithm 
that we call SPCTF, we analyze Spellman et al. microarray data for cell cycle 
genes in yeast, and find clusters with a higher number of elements compared 
with those obtained with the SPC algorithm. Some of the incorporated genes 
by using SPCFT were not detected at first by Spellman et al. but were later 
identified by other studies, whereas several genes still remain unclassified. The 
clusters composed by unidentified genes were analyzed with MUSA, the motif 
finding using an unsupervised approach algorithm, and this allow us to select 
the clusters whose elements contain cell cycle transcription factor binding sites 
as clusters worth of further experimental studies because they would probably 
lead to new cell cycle genes. Finally, our idea of introducing available infor- 
mation about transcription factors to optimize the gene classification could be 
implemented for other distance-based clustering algorithms. 

Keywords: Superparamagnetic clustering, similarity measure, microarrays, 
cell cycle genes, transcription factors. 
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1. Introduction 



DNA microarrays allow the comparison of the expression levels of all genes in 
an organism in a single experiment, which often involve different conditions {i.e. 
health-illness, normal-stress), or different discrete time points {i.e. cell cycle) 
[l|, Among other applications, they provide clues about how genes interact 
with each other, which genes are part of the same metabolic pathway or which 
could be the possible role for those genes without a previously assigned function. 
DNA microarrays also have been used to obtain accurate disease classifications 
at the molecular level 0,0, 01 • However, transforming the huge amount of data 
produced by microarrays into useful knowledge has proven to be a difficult key 
step 0. 

On the other hand, clusteringtechniques have several applications, ranging 
from bioinformatics to economy 0, Q • Particularly, data clustering is proba- 
bly the most popular unsupervised technique for analyzing microarray data sets 
as a first approach. Many algorithms have been proposed, hierarchical cluster- 
ing, k-means and self-organizing maps being the most known [ll[lH. Cluster mg 
consists of grouping items together based on a similarity measure in such a way 
that elements in a group must be more similar between them than between el- 
ements belonging to different groups. The similarity measure definition, which 
quantifies the affinity between pairs of elements, introduces a priori informa- 
tion that determines the clustering solution. Therefore, this similarity measure 
could be optimized taking into account additional data acquired, for example, 
from real experiments. Some works with a priori inclusion of bioinformation in 



clustering models can be found in jl2l [1 



In the case of gene expression clustering, the behavior of the genes reported 
by microarray experiments is represented as points in a D-dimensional space, 
being N the total number of genes, and D the number of conditions. Each gene 
behavior (or point) is then described by its coordinates (its expression value for 
each condition). Genes whose expression pattern is similar will appear closer 
in the D-space, a characteristic that is used to classify data in groups. In 
our case, we have used the Superparamagnetic Clustering Algorithm (SPC) 
14, 16, 12 1, which was proposed in 1996 by Domany and collaborators as a 



new approach for grouping data sets. However, this methodology has difficulties 
dealing with different density clusters, and in order to ameliorate this, we report 
here some modifications of the original algorithm that improve cluster detection. 
Our main contribution consists on increasing the similarity measure between 
genes by taking advantage of transcription factors, special proteins involved in 
the regulation of gene expression. 

The present paper is organized as follows: in Section 2, the SPC algorithm 
is introduced, as well as our proposal to include further biological information 
and our considerations for the selection of the most natural clusters. Results for 
a real data set, as well as performance comparisons, are presented in Section 3. 
Finally, Section 4 is dedicated to a summary of our results and conclusions. 
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2. Method 



2.1. Superparamagnetic Clustering Algorithm (SPC) 

A Potts model can be used to simulate the collective behavior of a set of 
interacting sites using a statistical mechanics formalism. In the more general 
inhomogeneous Potts model, the sites are placed on an irregular lattice. Next, in 



the SPC idea of Domany et al. , each gene's expression pattern is represented 
as a site in an inhomogeneus Potts model, whose coordinates are given by the 
microarray expression values. In this way, a particular lattice arrangement is 
spanned for the entire data set being analyzed. 

A spin value ct;, arbitrarily chosen from q possibilities, is assigned to each 
site, where i corresponds to the site of the lattice i ~ 1,2,...,N. The main 
idea is to characterize the resulting spin configuration by the ferromagnetic 
Hamiltonian: 

H = -^JijSa,,a,, ai = l,...,q, (1) 

where the sum goes over all neighboring pairs, ai and aj are spin values of site 
i and site j respectively, and Jij is their ferromagnetic interaction strength. 

Each site interacts only with its neighbors, however since the lattice is ir- 
regular, it is necessary to assign the set of nearest-neighbors of each site using 
the so-called fc- mutual- nearest-neighbor criterion The original interaction 
strength is as follows: 

ie^ 2^ if i and j are neighbors 

(2) 

otherwise, 

with K the average number of neighbors per site and a the average distance 
between neighbors. The interaction strength between two neighboring sites 
decreases in a Gaussian way with distance dij and therefore, sites that are sep- 
arated by a small distance have more probability of sharing the same spin value 
during the simulation than the distant sites. On the other hand, said proba- 
bility, Pij = (1 — e^^'-^'j/"^-*), also depends on the temperature T, which acts 
as a control parameter. At low temperatures, the sites tend to have the same 
spin values, forming a ferromagnetic system. This configuration is preferred 
over others because it minimizes the total energy. However, the probability of 
encountering aligned spins diminishes as temperature increases, and the system 
could experience either a single transition to a totally disordered state (para- 
magnetic phase), or pass through an intermediate phase in which the system 
is partially ordered, which is known as the superparamagnetic phase. In the 
latter case, varies regions of sites sharing the same spin value emerge. Sites 
within these regions interact among them with a stronger force, exhibiting at 
the same time weak interactions with sites outside the region. These regions 
could fragment into smaller grains, leading to a chain of transitions within the 
superparamagnetic phase until the temperature is so high that the system en- 
ters the paramagnetic phase, where each spin behaves independently. This 
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hierarchical subdivision in magnetic grains reflects the organization of data into 
categories and subcategories. Regions of aligned spins emerging during simula- 
tion correspond to groups of p oints with similar coordinates, i.e., similar gene 



expression patterns 14lll5lll6l|. This subdivision can be simulated, for example, 
by using the Monte Carlo approach, by which one can compute and follow the 
evolution of system properties such as energy, magnetization and susceptibility, 
while the temperature is modified. In addition, the temperature ranges in which 
each phase transition takes place can be localized. 

Rather than thresholding the distances between pairs of sites to decide their 
assignment to clusters, the pair correlation Gy, indicating a collective aspect of 
the data distribution, is preferred . It can be calculated as follows [l^ 

G,, - — ^ . (3) 

In this way, dj is the normalized probability for finding two Potts spins 
Ui and (jj sharing the same value for a given temperature step. If both spins 
belong to the same ordered region, their correlation value would be close to one. 



otherwise their correlation would be close to zero |17| . Thus, for each tempera- 
ture step, two sites are assigned to the same cluster if their correlation exceeds 
a threshold value of Gij > 0.5. If a site does not have a single correlation value 
greater than 0.5, it is joined with its neighbor showing the highest value. 



2.2. Transcription Factors in SPC (SPCTF) 

For our SPCTF algorithm, we also accept sites whose Gij are larger than 
0.5 in order to build a cluster. However, differently from the traditional SPC 
algorithm [l4, 15, 1^ 17 1, if two sites do not reach the Gij value greater than 



0.5 they are not connected. This is because with our data we have found that 
the original condition led to unnatural growth of some clusters when the tem- 
perature is increased. 

As already mentioned, the data are fragmented in various clusters for each 
temperature value, and for higher temperatures, the number of clusters increases 
due to finer and finer segmentation. In order to select the more representative 
clusters through all temperature steps, we assign a stability value to each ob- 
tained cluster, based on its evolution. We define Tt as the number of temperature 
steps until the system reaches the paramagnetic phase and r„ as the number of 
temperature steps a cluster v survives, while It and /„ are defined as the total 
number of sites and the number of elements in a given cluster, respectively. We 
assign a stability parameter S^, to each cluster, as follows: 

colyrowy . 

Sv = I r—, 4) 

I coLy — roWy I -I- e 

where coZ„ = ^ is the fraction of temperature steps a cluster v survives, while 
rowy = is the fraction of total elements belonging to v. The advantage 
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of using the stabihty parameter is that it gives preference to clusters that 
survive several temperatures, but also have an acceptable number of elements. 
We added a small positive real number e to the denominator in the expression 
of for the special case when coly — rowy ~ n, where n belongs to the range 
(0, 1], leading to Sy = ^ instead of the infinity. 

It has been reported that the main drawback of the SPC algorithm consists 
of dealing with data showing regions of different density [l^ In this case, 
either depending on temperature or the number of neighbors selected, some 
clusters will easily get prominent whereas the detection of others will be hin- 
dered. To overcome this problem, at least two techniques have been proposed 
e.g ., sequential superparamagnetic clustering [l9j and a modularity approach 
[20| . Our idea is to take advantage of already available biological information to 
improve lattice connectivity in such a way that biologically significant clusters 
have more probability of being detected by the algorithm. 

Indeed, at the transcriptional level, the expression of a gene could be pro- 
moted/suppressed by the binding of the proteins named transcription factors to 
specific sequences on the gene promoter region. Then, if a group of genes shows 
the same expression behavior in a microarray experiment, it is quite possible 
that they are being regulated by a specific transcription factor, forming a group 



of coregulated genes (2lJ. Thus, available information about which genes are 
targeted by the same transcription factors may be useful in the detection of 
groups of genes with similar expression profiles. 

To make effective this idea, we downloaded from www.yeastract.com a list 
of yeast transcription factors that are well documented, and whenever two 
neighboring genes are controlled by the same transcription factor, we increased 
their interaction strength. It is important to note that the list provided by 
www.yeastract.com includes transcription factors associated with several pro- 
cesses and are not only cell cycle related. The formula that takes this into 
account replaces Eq. ^ of the original algorithm, and has the following form: 

if i and j are neighbors. 




(5) 

otherwise. 

Here, F = fn is the number of common transcription factors shared by i and 
j (n, which varies for each pair of neighboring genes), multiplied by a factor / 
which was chosen to be 2.0 after comparing the results obtained with several 
other values. The selected value has the characteristic of preserving well-defined 
susceptibility peaks as well as obtaining larger clusters. The objective is to 
strengthen some connections without preventing the natural fragmentation of 
clusters caused by the temperature parameter. If two elements do not share a 
transcription factor, then F = 1, recovering the original SPC formula. There- 
fore, the modified interaction strength between each site and its neighbors is 
governed by two aspects: the distance between them, which comes from gene 
expression values generated through microarray experiments, and the number 
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of transcription factors regulating both genes, obtained from documented bio- 
logical data. Any time two genes share a transcription factor, their interaction 
strength becomes larger, and this favors that the clusters including these sites 
remain stable for longer temperature ranges, with the corresponding increase of 
their stability values. 



3. Results and Discussion 

We analyzed Spellman et al. (22j microarray data in which gene expression 
values from synchronized yeast cultures were obtained at various time moments, 
aiming to identify cell cycle genes. Yeast cultures were synchronized by three 
methods: adding alpha pheromone, which arrests cells in the Gl phase; using 
centrifugal elutration for separating small Gl cells; and using a mutation that 
arrests cells late in mitosis at a given temperature. Gombining the three ex- 



periments and using Fourier and correlation algorithms, Spellman et al. [22 1 
reported 800 cell cycle regulated genes. 

The goal was to compare the performance of SPG and SPG with tran- 
scription factors (SPGTF), which are algorithms that do not make assump- 
tions about periodicity. Nonetheless, the overall analysis is time consuming and 
we only selected the data set treated with the alpha pheromone, available at 



http://cellcycle-www.stanford.edu Genes with missing values were discarded, 
leaving an input matrix of 4489 genes and 18 time courses that included only 
613 of the genes reported by Spellman et al. Furthermore, as we do not 

include the other two synchronization experiments, we expect to loose some of 
their cell cycle genes. 

It is worth mentioning that Getz et al. (23l | also analyzed the Spellman alpha 
synchronized set with the SPG algorithm. They took 2467 genes which have 
characterized functions and introduced a Fourier transform to take into account 
the oscillatory nature of the cell cycle. In our case, however, we decided not to 
introduce any considerations about the periodicity of the data, mainly because 
the time series cover only two cell cycle periods |24| . 

We obtain compact gene clusters implementing SPG original algorithm and 
SPGTF, both with parameter values k = 8 and q = 20. The cluster with the 
highest stability value contains an extremely large number of elements without 
a clear biological linkage between them. It is mainly composed of genes whose 
expression do not change significantly over time, thus it is possible that they are 
included here for this very reason. We discard this cluster from our analysis, 
although it could always be taken apart and analyzed again with SPGTF by 
choosing the appropiate number of neighbors to obtain more information. 

To compare in more detail both approaches, it is necessary to correlate each 
cluster in the SPG method with its equivalent in SPGTF. In order to do this, 
we calculate the euclidian distance between the mean position vector of every 
cluster in each approach, and choose the pairs with the shortest distance be- 
tween them. (We recall that the mean position vector of a cluster is obtained 
by averaging each coordinate between all its elements). Although different mea- 
sures could have been used, this one performed adequately, as can be seen in the 
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supplementary information file, where we provide a more detailed comparison 
between SPCTF and SPC clusters. In Table [TJ we present the differences in 
cluster size as well as the hits, the number of genes reported by Spellman et al. 



22|, which have been included in the clusters. When going through the SPCTF 
approach, one can see that the first largest cluster looses some genes, while the 
number of the rest of the clusters augments. Besides, hits or coincidences with 
Spellman et al. [2^ cell cycle genes in clusters of six or more elements increase 
by 61%, from 108 to 174. Therefore, we were able to incorporate several genes 
to these clusters, mainly from outliers. 

Comparison between SPC and SPCTF 



Method 


First 
Cluster 


Cluster 
size > 6 


Cluster 
size = 5 


Cluster 
size = 4 


Cluster 
size = 3 


Cluster 
size = 2 


Cluster 
size = 1 


Total 
Clusters 


Total 
Gerres 


Total 
Hits 


SPC 
SPCTF 


1(2078) 68 
1(1657) 64 


19(220) 108 
27(359) 174 


5(25) 2 
13(65) 23 


23(92) 11 
32(128) 22 


57(171) 39 
61(183) 30 


144(288) 49 
187(374) 60 


1615 336 
1723 240 


1864 
2044 


(4489) 
(4489) 


613 
613 



Table 1: Number of clusters for different cluster size. The total number of genes for each 
cluster size appears in parentheses and their hits with Spellman et al. [2^ appear in bold 
type. Hits with the 613 cell cycle genes reported by Spellman et al. [2^ increase for clusters 
of size 6 and bigger, while decreasing in the first cluster and outliers. 



In the following analysis, we focus on clusters of six or more elements, be- 
cause we are interested in finding groups of several genes sharing the same 
expression pattern (corcgulated genes) . Results of the comparison for the first 
27 most stable clusters, discarding the first one, are shown in Fig. [1] Gener- 
ally, these clusters incorporate more elements with SPCTF, including more cell 
cycle genes as those reported by Spellman et al. [2^ and thus improving the 
matching. 

Depending on the available information about the genes, we classify the 
clusters in three groups. The first cluster type, cell cycle genes, CC, corresponds 
to groups formed in their majority (> 85%) by already reported cell cycle genes 
(Fig. [5]). The second type, mixed genes, M, contains clusters with non-reported 
genes as well as already known cell cycle genes (Fig. [S]), and in the third type, 
no hits, N, we include the clusters that contain only one hit or are entirely 
composed of non-prcviously identified cell cycle genes (Fig. |3]). 

It is worth mentioning that more cell cycle experiments have been done since 
Spellman et al. and new genes have been classified meanwhile as cell cycle 
regulated. Some of these newly reported cell cycle genes were obtained by Cho 
et al. 25 1, Pramila et al. [i^, Rowicka et al. [iJl and Lichtenberg et al. [2^. 
We analize our 27 clusters taking now as hits, genes reported either by Spellman 



et al. [22| or by one of the above mentioned studies. In this way, we gained 



thirty additional hits in the SPC clusters, while in SPCTF clusters we have 
fifty-two extra genes. The results including all the aforementioned cell cycle 
studies arc presented in Figs. l4H6l [29|. 

In addition, wc analyze the expression profiles of the genes conforming each 
cluster using the SCEPTRANS tool [s^, and we notice that all the genes 
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Cluster Name 

Figure 1: General comparison of the first 27 clusters, discarding the first one. Gray bars 
correspond to the clusters obtained with the SPG algorithm and black bars to the equivalent 
clusters in SPCTF. Groups tend to increase in size and also in hits with cell cycle genes 
reported by Spellman et al. |22| . with the exception of cluster 11. 




□ no hits with Speiiman 
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Figure 2: Comparison between the SPG and SPGTF results, showing the GG clusters. Gray 
bars correspond to the clusters obtained with the SPG algorithm and black bars to the equiv- 
alent clusters in SPCTF. 
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Cluster Name 



20 21 22 26 27 



Figure 3: M and N clusters, left and right respectively. Gray bars correspond to the clusters 
obtained with the SPC algorithm and black bars to the equivalent clusters in SPCTF. 
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Figure 4: General comparison of the first 27 most stable clusters. Hits are now taken as cell 
cycle genes reported by all studies. Gray bars correspond to the clusters obtained with the 
SPC algorithm and black bars to the equivalent clusters in SPCTF. 
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Figure 5: Comparison between SPC and SPCTF results, showing CC clusters. Gray bars 
correspond to the clusters obtained with the SPC algorithm and black bars to the equivalent 
clusters in SPCTF. 




Figure 6: M and N clusters, left and right respectively. Gray bars correspond to the clusters 
obtained with the SPC algorithm and black bars to the equivalent clusters in SPCTF. 
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grouped in the same cluster had the same expression pattern. This gives us 
further confidence that our algorithm is grouping data correctly. The expres- 
sion profiles for a representative member of each cluster type are shown in Fig. 
[71 We also find two clusters (21 and 27) that present an oscillating behaviour 
that is due to an artifact in the manner the microarray experiment was per- 
formed, see [sills^l- In the supplementary information file, we include the list 
of oscillating genes identified in 3l| and the number of these genes inside each 
of our first 27 clusters. We also include the expression profiles of these clusters 
as well as those of size 5 and 4 which contain hits with cell cycle genes identified 
by Spellman et al. [22j- These clusters have also similar expression profiles 
but were not further analyzed because of their low number of elements. In the 
case of gene annotation, it is important to have clusters of many elements to 
effectively assure that an unknown gene shares the biological function already 
assigned to the other genes in the same cluster. 

The CC clusters are almost enti rely composed of cell cycle regulated genes 



reported either by Spellman et al. [2^ or by other authors, besides, their ex- 



pression patterns are similar, which leaves no doubt on their validity. For the 
M and N clusters, we know that they are well grouped because their elements 
share the same expression patterns, but in order to select those of worth for fur- 
ther analysis (for example in a laboratory experiment) we analyze them through 
MUSA, motif finding using an unsupervised approach algorithm, that can be 
found at www. yeastract.com. This program searches for the most common se- 
quences (motifs) in the regulatory region of a set of genes, and compare them 
to the transcription factor binding sites already described in yeastract database 



33l . 1341 . Results of this analysis are shown in Table [21 which includes the quo- 
rum or percentage of genes containing a motif in each cluster, and the alignment 
score, which quantifies the level of similarity between the encountered motif and 
the known transcription factor associated with it. The clusters that probably 
would give us the best results would be those associated with cell cycle tran- 
scription factors with high percentages and scores. We select in this way, the 
clusters 1, 5, 9, 12, 16 and 24 because they have percentages higher than 70% 
and scores higher than 80%. 

In order to validate the MUSA analysis, we also constructed various clusters 
with sizes ranging from six to thirty-seven genes that were composed by genes 
selected at random from the original data. When analyzing these random clus- 
ters in the same way in MUSA, we obtain at most two cell cycle transcription 
factor coincidences. 



4. Summary and conclusions 

Large amounts of biological information are constantly obtained by through- 
put techniques and clustering algorithms have taken an important place in the 
unraveling of this information. However, the clustering analyses offer a difficult 
challenge because any data set can be grouped in numerous ways, depending on 
the level of resolution asked for and the applied similarity measure. In this work, 
we propose the use of available biological information in order to strengthen the 
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Figure 7: (Color online) Expression profiles for a representative member of each cluster type 
using the SCEPTRANS tool. Expression profiles for all clusters are available in the supple- 
mentary information. 
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MUSA analysis 
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Type 
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76 47 % 
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Low percentages 
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N 


Mlgop, iVUgip, LjlZip. Ailg^p tO/uJ 


fin v 






lt,IXip, ArgOip 1 j 


70 % 


10 


N 


Low scores 




11 


N 


Azflp (6/7) 




12 


N 


Azflp (7/8) 


88.89 % 






Rfxlp. Cup2p (5/6) 


77.78 % 


14 


M 


Azflp (7/8) 


75 % 


15 


M 


Low scores 


Low percentages 


16 


N 


Mcmlp (5.25/6), Crzlp (5/6) 


100 % 






Haplp (5/6) 


71.43 % 


17 


N 


Arg81p, Upc2p, Sip4p, Roxlp. Crzlp, Zaplp (5/6) 


100 






Pdr8p (5.33/6) 


87.5 % 


18 


N 


Azflp, Zaplp (6/7) 


10(1 


20 


N 


Low scores 




21 


N 


Low scores 




22 


N 




Low percentages 


23 


M 


Low scores 




24 


M 


Haplp (6/6), Ecm22p, Upc2p (5/6) 


100 % 






Rfxlp (6/7) 


83.33 % 


25 


M 


Haclp (6/7) 


83.33 % 


26 


N 


Dal80p, Gatlp, Gln3p. Gzf3p (6/7) 


83.33 % 


27 


N 


Ino4p (6.5/7), Ino2p (6/7) 


100 % 



Table 2: Results for quorum higher than 70% and scores higher than 80%. Transcription 
factors associated to cell cycle are shown in bold. The most confident clusters are taken as 
those that included cell cycle transcription factor. 



interaction between genes which share a transcription factor involved in any 
metabolic process, improving the similarity measure. This information is intro- 
duced in the natural evolution of the SPC algorithm, and in this way, we are able 
to enhance the creation and endurance of groups of possible coregulated genes. 
As the network spanned by the transcription factors information connects all 
genes, clustering directly a posteriori using only this information in the present 
case results into a single massive cluster (See section IV of the Supplementary 
Information). However, by having the distance play an important weight in 
the interaction formula, the far-located clusters will not join, despite sharing 
transcription factors between their genes. 

With this in mind, we have modified the SPC algorithm, and applied both 
the original and modified SPCTF algorithm to one of the three SpcUman et al. 



22| data sets of the yeast cell cycle. The expression profiles of the genes in all 



resulting clusters show a similar behavior, but we obtain larger clusters with 
SPCTF. We classified them in three types, CC, M, and N, depending on the 
amount of cell cycle reported elements inside each cluster. With SPCTF, the CC 
type clusters increase in size including more cell cycle genes, and for the M and N 
type clusters, we also looked for common sequences in its regulatory regions and 
selected various groups worth of further research in order to report possible new 
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cell cycle genes. As expected, some of these clusters include already known cell 
cycle genes sharing a transcription factor, but more importantly, at the predictive 
level, they promote the inclusion of new genes with similar expression patterns. 
It is also important to note that the modified algorithm can be applied to any 
data set, and the followed methodology leads to the selection of the potential 
gene subsets feasible to be experimentally investigated. Our work can serve 
as an example of how the inclusion of available biological information, such 
as transcription factors, and bioinformatic tools, such as MUSA, can lead to 
better and more confident results, aiding in the analysis of data coming from 
microarray experiments. 
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